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ABSTRACT 

Enhancer elements are essential for tissue-specific 
gene regulation during mammalian development. 
Although these regulatory elements are often 
distant from their target genes, they affect gene ex- 
pression by recruiting transcription factors to 
specific promoter regions. Because of this long- 
range action, the annotation of enhancer element- 
target promoter pairs remains elusive. Here, we 
developed a novel analysis methodology that takes 
advantage of Hi-C data to comprehensively identify 
these interactions throughout the human genome. 
To do this, we used a geometric distribution-based 
model to identify DNA-DNA interaction hotspots 
that contact gene promoters with high confidence. 
We observed that these promoter-interacting 
hotspots significantly overlap with known 
enhancer-associated histone modifications and 
DNase I hypersensitive sites. Thus, we defined thou- 
sands of candidate enhancer elements by 
incorporating these features, and found that they 
have a significant propensity to be bound by p300, 
an enhancer binding transcription factor. 
Furthermore, we revealed that their target genes 
are significantly bound by RNA Polymerase II and 
demonstrate tissue-specific expression. Finally, we 
uncovered that these elements are generally found 
within 1 Mb of their targets, and often regulate 
multiple genes. In total, our study presents a novel 
high-throughput workflow for confident, genome- 
wide discovery of enhancer-target promoter pairs, 
which will significantly improve our understanding 
of these regulatory interactions. 



INTRODUCTION 

While the Human Genome Project was declared complete 
in 2003, many regulatory elements still remain undefined. 
Enhancers are one such class of elements because true 
definition of an enhancer requires identification of both 
the regulatory sequence and its interacting promoter 
region(s). Enhancer-target identification is further 
complicated by the fact that they interact in an 
orientation-independent manner, can be milhons of base 
pairs away from each other or even reside on different 
chromosomes (1-3). Enhancer elements also have 
dynamic regulatory activities under various developmen- 
tal and environmental conditions. For instance, they can 
activate gene expression in a tissue- and temporal-specific 
manner. Thus, they affect different sets of genes in differ- 
ent tissues (4) and/or play variable regulatory roles during 
animal development (5,6). One well-studied example of 
this dynamic property is the locus-control region (LCR) 
that regulates the cluster of five human p-type globin 
genes on If pi 5.4 (6). These globin genes are exclusively 
expressed in erythroid cells and are expressed differentially 
in fetal and adult cells mediated by the LCR that is located 
about 40 kb upstream. 

Recent studies reveal that in eukaryotes, histone modi- 
fications such as histone 3 lysine 27 acetylation (H3K27ac), 
histone 3 lysine 4 mono-methylation (H3K4mel), di- 
methylation (H3K4me2) and tri-methylation (H3K4me3) 
can play crucial roles in the activation of enhancer 
elements under different environmental conditions, cell 
lineages, tissue types or developmental stages (4,7-10). 
These activating histone marks tend to be present in 
enhancer elements that are activated and absent when 
they are repressed. Additionally, activated regulatory 
elements are more hkely to be located within the context 
of accessible (open) chromatin where they can be bound by 
transcription factors. The accessibility of specific DNA 
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sequences can be determined by their sensitivity to digestion 
by DNase I, witli open chromatin being highly digested and 
vice versa. Recently, large-scale studies of activating (e.g. 
H3K27ac) histone marks and DNase 1 hypersensitive sites 
(DHSs) such as those from the Encyclopedia of DNA 
Elements (ENCODE) (11,12) have been used in various 
human cell types to predict enhancer elements (9,10). 
Additionally, other high- throughput studies assaying El A 
binding protein p300 and CREB binding protein interaction 
sites have also been used to discover putative enhancers 
(4,13,14). Although these studies can predict enhancer 
elements on a large scale, they suffer from the inability to 
globally identify the target gene promoters of the identified 
enhancer elements. 

Although the mechanism of enhancer-target promoter 
interaction formation is still not well understood, it is 
commonly accepted that enhancers and promoters 
interact with each other through a transcription factor 
protein complex (15). Based on this model, the chromo- 
some conformation capture (3C) approach can be used to 
identify enhancer elements as well as their target genes 
simultaneously by detecting two hnearly independent 
DNA segments that are bound to one another via a 
protein complex. One major drawback of the 3C 
approach is that it requires prior knowledge of the 
putative enhancer and promoter elements to allow design 
of specific PCR primers, which is often unknown. To 
address this limitation, a high-throughput version of 3C 
was developed (Hi-C) (16) to detect genome-wide DNA- 
DNA interaction events. This approach avoids multiple 
PCR steps by ligating interacting DNA elements 
followed by high-throughput sequencing to provide 
unbiased identification of DNA-DNA interacting pairs. 
Several variants have been developed by other groups to 
identify the chromosome organization and regulatory sites 
of the human, yeast and Drosophila melanogaster genomes 
(17-20). However, these original studies focused on deter- 
mining large-scale chromosomal organization, and did not 
demonstrate whether the high-throughput sequencing 
variant of 3C is sensitive or specific enough for prediction 
of enhancer-promoter interactions. 

More recently, Chepelev et at. (21) developed chromatin 
interaction analysis using paired-end tag sequencing 
(ChlA-PET), which is a strategy that combines 3C with 
ChlP-seq, for an enhancer associated histone modification 
(H3K4me2) to identify intra-chromosomal enhancer- 
promoter interactions (22). This led to the successful iden- 
tification of only intra-chromosomal enhancer-promoter 
interactions that were associated with a specific histone 
modification (H3K4me2). Another recent study apphed 
the variant 3C method [carbon-copy chromosome con- 
formation capture (5C)] to identify ~100 enhancers and 
their specific target genes by designing ~6000 primers 
along the ENCODE pilot project regions (23). Although 
none of these previous studies were at the genome-wide 
scale, they have demonstrated that datasets produced by 
the 3C method can be used for genome-wide identification 
of enhancer-target promoter interactions. 

Here, we revisit the original Hi-C experimental data 
with the goal of identifying enhancer-target gene inter- 
actions on a genome-wide scale for humans. To do this, 



we developed a new analysis framework for Hi-C experi- 
ments that integrates multiple genome-wide enhancer- 
defining datasets to identify enhancer-target gene pairs. 
Using this approach, we identified thousands of high-con- 
fidence enhancer-target promoter interactions in two dif- 
ferent human cell types. We vahdated these interaction 
pairs by demonstrating our putative enhancer elements 
are highly correlated with known p300 binding sites, and 
their target gene promoters are enriched in RNA 
Polymerase II (Pol 11) binding. Furthermore, we found 
that the predicted enhancer elements are conserved in 
the mammalian lineage, and their target genes are 
expressed in a highly cell type-specific manner. In total, 
our pipehne has allowed the first robust and genome-wide 
discovery of thousands of novel enhancer-promoter inter- 
actions in the human genome. 

MATERIALS AND METHODS 

Hi-C data from two human cell types 

To comprehensively identify putative enhancer-target 
promoter interactions in the human genome, we first 
downloaded the original genomic alignments for the 
paired-end Hi-C sequencing data from two different 
human cell fines, a lymphoblastoid (GM06990) and a 
chronic myelogenous leukemia (K562) cell fine (GEO ac- 
cession number GSE18199). After an initial analysis, we 
found that the overlap of extended hotspots (defined 
below) between biological replicates of the GM/Hindlll 
sample is 50.6% (P<2.2e-16, chi-square test). 
Furthermore, the sequencing reads from these same 
samples were also combined in the original Hi-C study. 
Therefore, to more comprehensively identify DNA-DNA 
interacting pairs, mapped reads from biological replicates 
of the GM/Hindlll sample were merged to single datasets. 
In total, we looked at the interacting patterns for three 
different sample sets from the original Hi-C study: 
GM06990 cells whh Hindlll (GM/Hindlll), GM06990 
with Ncol (GM/Ncol) and K562 with Hindlll (K562/ 
Hindlll) digestion. 

Identifying statistically significant DNA interacting 
hotspots from Hi-C datasets 

We first identified significant clusters in the Hi-C data 
using a geometric distribution-based model (24). To do 
this, we assembled all mapped reads for a given dataset 
(GM/Hindlll, GM/Ncol or K562/HindIII) into consecu- 
tive contigs (made up of overlapping reads) for each 
nuclear chromosome, without initially considering the 
read pairing information for these libraries. This 
approach allowed us to determine the gap regions 
between the identified contigs. These gap lengths should 
follow a geometric distribution: 

P(X, = k) = (l -Pif-'pi 

P{Xi<k)=\-{\ -pif 

where Z,- and pj are the gap lengths and the probability 
of a position covered by any read on chromosome i. 
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respectively. Accordingly, we fit the gap lengths to a geo- 
metric distribution for each chromosome (Supplementary 
Figure SI) and estimated Pi based on the mean gap length. 
We then grouped contigs into clusters by merging nearby 
contigs based on the gap distances between them. 
Specifically, contigs were merged into significant clusters 
if they are closer to each other than the 5% quantile 
according to the fitted geometric distribution. 

Next, we identified high-confidence DNA interacting 
hotspots by fitting cluster lengths to an additional geomet- 
ric distribution for each nuclear chromosome 
(Supplementary Figure S2), where the X, value is based 
specifically on cluster length and the p,- value is the 
emission probabihty based on the mean cluster length 
calculated for the Hi-C data for chromosome /. Only the 
significant clusters (>99% quantile) identified with this 
second geometric distribution-based test were retained 
and defined as DNA interacting hotspots. It is worth 
noting that we did not take into account the Hi-C inter- 
action data for these hotspots during this analysis step, 
but only looked for interacting partners during our 
analysis to identify those hotspots that are putative 
enhancer elements (see below). We also analyzed DNA 
interacting hotpots identified using the quantiles of 98 
and 99.9%, and the results of these analyses are presented 
in Supplementary Figures S4 and S5, respectively. 

Compensating for restriction enzyme fragmentation bias 
and identifying bona fide DNA interacting hotspots 

In Hi-C, the resulting sequencing reads are enriched for 
regions of the genome near or at the restriction enzyme 
(RE) sites used in the experiment (as also indicated by 
our motif analysis; see corresponding Results section) 
rather than the actual enhancer-target interaction site. 
Thus, the actual interaction site could be at any location 
between two restriction sites that are thousands of base 
pairs apart. To address this bias, we examined the distri- 
bution of the distances between adjacent restriction sites 
within the human genome, which was then fitted to a 
poisson distribution: 

P{X=k) = k''e-^/k\ 

where X is based on the distances between adjacent 
restriction sites, and parameter X is the mean of the 
distances along the genome (Supplementary Figure S3). 
Based on the estimated parameter A, we extended the 
boundaries of DNA interacting hotspots in both direc- 
tions from the midpoint of each hotspot to the length 
of the 95% quantile (~3kb) of the fitted poisson distribu- 
tion. This extension makes it much more hkely that we 
are covering the actual DNA-DNA interaction sites, 
which are linked to the closest RE site by the nature of 
the Hi-C protocol. The resulting regions were referred 
to as extended hotspots. 

Identification of candidate enhancer elements enriched in 
activating histone modifications 

We began selecting for candidate enhancer elements (CEEs) 
by focusing on the extended hotspots that (i) had at least 



one of its interacting regions overlapping a protein-coding 
gene promoter and (ii) the particular CEE-promoter inter- 
action was supported by > 1 paired-end sequencing read in 
the corresponding Hi-C dataset. We then determined the 
overlap (see below for description of enrichment analyses) 
between these promoter-interacting extended hotspots and 
the four activating histone marks (H3K4mel, H3K4me2, 
H3K4me3 and H3K27ac) that are known to be associated 
with enhancer elements in the human genome (4,7-10). We 
also examined the overlap of these promoter-interacting 
extended hotspots with H3K27me3, a heterochromatic 
histone modification (23) that is not enriched at enhancer 
elements. To further select for CEEs that are likely bona 
fide enhancer elements, we only maintained promoter- 
interacting extended hotspots containing known enhancer- 
related histone modifications that are also enriched in 
DHSs. Thus, CEEs are defined as highly confident 
promoter-interacting extended hotspots enriched in known 
enhancer-related histone modifications and DHSs. For 
these enrichment analyses, the histone modification and 
DHS data was downloaded from the UCSC ENCODE pro- 
duction phase (hgl8 assembly) (24,25). It is of note that we 
used the lymphoblastoid cell line (GM 12878) and chronic 
myelogenous leukemia (K562) from the ENCODE project 
in our study, as they are the most closely related cell lines to 
those used in the original Hi-C study (GM06990 and K562). 
As a control, we generated 1000 sets of the same number of 
extended hotspots randomly selected from the human 
genome (random extended hotspots), and used them as a 
background to evaluate the significance of enrichment for 
all subsequent analyses. 

Enrichment analyses 

All enrichment analyses for CEEs and their target pro- 
moters (e.g. p300 binding) were performed by computing 
the enrichment index {ERI) as a ratio of the two 
proportions: 

ERI{A) = C{A)/P{A) 

where A is the set of intervals for a particular histone modi- 
fication or other genomic feature (e.g. DHS, p300 binding 
or Pol II binding) determined using ENCODE ChlP-seq or 
DNase-seq experiments (26). C(A) is the total length in 
base pairs of CEEs (or interacting promoter hotspots if 
we are examining target promoter characteristics) that 
overlap with A, and P(A) is the mean of total lengths 
that overlap with A from 1000 random control sets (see 
above). It is worth noting that in enrichment analyses for 
CEEs, each permuted set is selected randomly from the 
collection of all extended hotspots with the additional con- 
straints that they must have similar chromosomal and 
length distributions as the set of CEEs being analyzed 
(28). For enrichment analyses of CEE target promoters, 
each control set is selected randomly from the promoter 
regions of the 21 522 non-redundant protein-coding genes 
in the hgl8 assembly. Thus, a high ERI for a set of CEEs or 
their target promoters indicates that they tend to be over- 
lapping with a particular histone modification or binding 
feature when compared with all extended hotspots or non- 
redundant protein-coding gene promoters, respectively. 
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Characterizing p300 binding to CEEs and RNA Pol II 
binding to tbeir target promoters 

We downloaded the previously identified p300 and RNA 
Pol II binding sites from the UCSC ENCODE database 
for GM12878 and K562 cell lines (hgl8 assembly) 
(25,26,29). We then calculated the ERI for p300 binding 
within CEEs as well as RNA Pol II binding to CEE inter- 
acting promoters as described above. 

Determining cell type-specific expression of CEE 
target genes 

We downloaded the previously published gene expression 
profiles for the nine ENCODE human cefi lines 
(GSE26312) (30). Data were normalized by RMA 
(31-33) and log2-transformed. We aggregated probeset- 
level to gene-level expression values for each cell fine as 
fofiows. For each probeset, we computed the average 
expression level across replicates. For each gene, we then 
computed the average expression across multiple probesets 
(if applicable). The gene expression profiles of the nine 
ENCODE cell lines were combined into a common gene 
set (13436 genes), and between sample expression values 
were normalized again to eliminate any array-specific bias 
using quantile normalization (normalize. quantiles function 
in R/affy package) (31). To determine if a gene has strong 
tissue-specific expression in either GM 12878 or K562 cells 
compared with the other seven cell types, we used an 
entropy-based metric (34) as follows. For each gene g, we 
computed p,.ig as the expression level in cell type c divided 
by the sum of expression levels across ah nine cell fines. The 
entropy (35) for g is defined as Hg = —J2i<csN Pi\g 
log2(Pc\g), where = 9 is the total number of cell types 
in this study. H^, ranges between 0 (gene g is expressed in 
only one ceU type) and log2(N) (gene g is expressed uni- 
formly in all cell types). To measure the specificity for a 
particular cell type c, we computed Qg^,. = Hg — logifpdg)- 
The quantity —log2( Pc\g) has a range between 0 (when gene 
g is only expressed in cell type c) and infinity (when gene g is 
not expressed in ceU type c). 

Sequence motifs in CEEs 

We examined the sequence motifs of the CEEs using the 
HOMER software package (36), and only considered 8, 10 
and 12 bp for the motif length in each sample. We used 
all extended hotspots as the background when searching 
for overrepresented motifs (-bg parameter in HOMER) 
in an effort to reduce potential biases introduced toward 
restriction sites owing to the original Hi-C protocol. 
Significance levels were set as P < 0.05. 

RESULTS 

Identifying candidate DNA interacting sites: hotspots and 
extended hotspots 

We built an analysis workflow that extracts high-quahty 
DNA interacting sites from Hi-C datasets. Figure 1 shows 
the overall workflow for identifying these DNA interact- 
ing hotspots, which we analyzed further to identify 
putative enhancer elements and their promoter partners. 



All three samples from the original Hi-C study (16) were 
used in our analyses [cell line GM06990 with REs Hindlll 
and Ncol, as wefl as cell line K562 with Hindlll (referred 
to as GM/Hindlll, GM/Ncol and K562/HindIII, respect- 
ively)]. The original Hi-C study used a 1 Mb window size 
to uncover the 3D organization of human nuclear 
chromosomes. However, this resolution is far too coarse 
for studying regulatory elements, which requires single 
nucleotide resolution. To improve resolution for our 
purposes of identifying DNA interacting hotspots, we 
applied our genomic distribution-based analysis for iden- 
tification of these specific genomic regions (24). Briefly, 
our algorithm first identifies clusters of Hi-C reads that 
are closer to each other than what the background geo- 
metric distribution dictates. We then labeled the resulting 
clusters as hotspots if their lengths on the chromosomes 
are longer than 99% of all clusters. We found that a 
hotspot is on average ~lkb in length, and between 
107 059 and 185 042 total hotspots were identified in 
each of the three samples. 

The Hi-C method dictates that sequencing reads will 
start at or near the sites of the RE used in the experiment 
rather than the actual DNA-DNA interaction site. 
Therefore, the resolution of this method is limited to the 
distance between the genomic sites of the particular RE 
used for that study (Figure 2). To account for this short- 
coming, we extended the length of the originally identified 
DNA interacting hotspots based on the estimated length 
between RE site positions on each human nuclear 
chromosome, while also allowing each nucleotide of an 
extended hotspot to represent the true interaction site. 
We found that on average an extended hotspot is 
3-3.3 kb long (Table 1), indicating that our resolution 
has improved ~300-fold compared with the 1 Mb 
window size used in the original study. 

Characterization of DNA interacting extended hotspots 

We classified aU extended hotspots based on human 
genome annotations and found that many of them are 
located within protein-coding genes, functional RNAs 
and tandem repeats, suggesting that some of the inter- 
action hotspots may be involved in regulatory processes 
(Figure 3a-c). Interestingly, we observed that extended 
hotspots were located within 5-20% of total promoter 
regions (defined as the 500 bp upstream of protein-coding 
gene transcription start sites) of the human genome. This 
led us to speculate that some of the extended hotspots 
from our reanalysis of Hi-C data may actually reflect 
target promoters that are interacting with enhancer 
elements in the human genome. 

Prediction of CEEs 

To identify CEEs, we first considered extended hotspots 
that interact with a protein-coding gene promoter 
region(s) (defined as the 500 bp upstream of the annotated 
transcription start site). As shown in Table 2, 22-62% of 
the extended hotspots interact with a protein-coding gene 
promoter. The variation in promoter interactions is Hkely 
a consequence of the number of promoters that are 
covered by extended hotspots, which is influenced by 



Nucleic Acids Research, 2013, Vol. 41, No. 9 4839 



both the total sequencing depth in a particular sequencing 
library and the REs and cell types used in the Hi-C 
experiments. We next examined the enrichment of 
promoter-interacting extended hotspots in four activating 
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Figure 1. Genome- wide enhancer element identification workflow. 



histone modifications known to be associated with 
enhancer elements (H3K27ac, H3K4mel, H3K4me2 and 
H3K4me3), and a heterochromatic histone modification 
(H3K27me3) as a negative control (25,29). As expected, 
we found that promoter-interacting extended hotspots are 
enriched (permutation test, P< 0.001) in all four 
activating histone modifications but not with H3K27me3 
(Figure 4a) when compared with the random background 
control. These results suggest that many of the promoter- 
interacting extended hotspots are human enhancer 
elements. 

To further improve our confidence that we are detecting 
bona fide enhancer-target gene promoter interactions, we 
added an additional quality control step where we only 
retain promoter-interacting extended hotspots if their 
promoter interaction is supported by more than one 
read (« > 1) in the sequencing results (Supplementary 
Figure S6). This filtering step dramatically reduced the 
number of potential enhancer elements in all three 
samples. In fact, only 7.7-12.2% of the promoter- 
interacting extended hotspots were retained as potential 
enhancer elements (Table 2). This step hkely reduced the 
number of false positives in our dataset, as we found it 
substantially increased the enrichment in the four 
enhancer-associated activating histone modifications 
(H3K27ac, H3K4mel, H3K4me2 and H3K4me3) in 
the remaining promoter-interacting extended hotspots 
(Figure 4b). Taken together, these results indicate that 
increased read support for the promoter-extended 
hotspot interactions is necessary for high-confidence iden- 
tification putative enhancer elements and their targets 
from Hi-C experimental data. 

The final filtering step in our pipehne to identify CEEs 
was to determine the enrichment of DHSs within the 
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Figure 2. Identification of potential enhancer elements by our novel analysis pipeline requires an extension of regions that have been termed DNA 
interacting hotspots from the 3C data as depicted. 
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Table 1. Characterization of extended hotspots 



Samples 


GM/Hindlll 


GM/Ncol 


K562/HindIII 


# Raw reads (paired spots) 


30 009111 


28 659 279 


36 823 509 


# Unique mapped pairs 


18 728 707 


18 891 283 


21 744 849 


Percentage of mapped to raw 


62% 


66% 


59% 


# Unique mapped single-end 


37 457414 


37 782 566 


43 489 698 


# Clusters (merged by gap length) 


4973 281 


5076 539 


6 247 694 


Average cluster length (bp) 


172.4 


168.9 


160.2 


# Hotspots 


107 059 


166 990 


185042 


Average hotspot length (bp) 


1047.9 


1007.8 


964.1 


# Extended hotspots 


96 800 


137611 


150611 


Average extended hotspot length (bp) 


3065.8 


3349.5 


3282.1 



GM = GM06990. 
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Figure 3. Functional annotation of extended hotspots for sample (a) GM/Hindlll, (b) GM/NcoI and (c) K562/HindIII. Each bar (as labeled) 
represents the percent of total length for each genomic feature that overlaps with extended hotspots. 



Table 2. Number of CEEs present after each filtering step 



Filtering step 


GM/ 


GM/ 


K562/ 




Hindlll 


Ncol 


Hindlll 


Promoter partners 


22818 


90 200 


93 109 


Strong interactions 


1757 


11001 


9955 


(>1 read) 








Activating histone 


928 


5617 


5814 


mark enrichment 








DNase I HS sites 


823 


4809 


5033 



subset of high-confidence promoter-interacting extended 
hotspots (supported by >1 sequencing read) using previ- 
ously pubhshed datasets (37,38). From this analysis, we 
found that the set of high-confidence promoter-interacting 
extended hotspots from all three original Hi-C experi- 
ments were enriched (/"< 0.001) in DHSs (Figure 4c). 
The tendency of high-confidence promoter-interacting 
extended hotspots to co-localize with DHSs provides 
further evidence of the rehabihty of our analysis strategy 
to identify bona fide enhancer element-target promoter 
pairs in the human genome. In summary, the combination 



of these results has led us to incorporate all three of these 
analysis steps in our pipeline for genome-wide prediction 
of CEEs and their interacting target promoters in the 
human genome (see Supplementary Dataset 1 for the 
entire hst). 

CEEs and their target genes are enriched in binding 
activities associated with gene expression 

To provide further evidence that our CEEs are bona 
fide enhancer elements, we examined the enrichment of 
p300 binding within these regions. We focused on p300 
because it is a known enhancer-associated co-activator 
that mediates the regulation of target gene expression 
(39,40). We found that the CEEs from all three Hi-C 
experiments were enriched (P< 0.001) in p300 binding 
compared with a background control of all extended 
hotspots (Figure 5a). This enrichment in p300 binding 
within CEEs strongly suggests we have identified bona 
fide enhancers, and by using the Hi-C data in this 
analysis we also identify the gene promoter(s) that each 
element can target. 

Enhancer elements generally activate gene expression 
through direct interaction with target promoters (40^2) 
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Figure 4. Potential enhancer elements are enriched for activating histone marks and DHSs. (a and b) Fold enrichment for activating (H3K27ac and 
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Figure 5. Potential enhancer elements are enriched for p300 binding, 
and their target genes are highly bound by Pol II. (a) P300 binding site 
enrichment in CEEs. (b) Pol II enrichment observed for the genes 
targeted by CEEs. Dashed line is the expected value based on a 
genomic control. 



that results in increased RNA Pol II association. 
Therefore, we tested whether the CEE target gene pro- 
moters were enriched for Pol II binding. From this 
analysis, we found that CEE target promoter regions are 
>20% enriched (P< 0.001) for Pol II binding when 
compared with the promoters of all other protein-coding 
genes. These results suggest that transcription initiation is 
increased at promoters that are in contact with the CEEs 
compared with all other gene promoters in the human 
genome. 

To further test if transcription is generally higher 
from target genes of the CEEs, we also investigated 
the enrichment of Pol 11 Ser2 phosphorylation, which 
marks elongating Pol II, within these loci for the K562 
dataset using previously published data (GSM935547). 
Interestingly, we observed a 1.47-fold enrichment 



(/■< 0.001) in Pol II Ser2 phosphorylation within target 
genes of our CEEs compared with aU other protein-coding 
genes. These results indicate that the CEE-target gene 
interaction not only increases Pol 11 promoter binding, 
but also effects transcription elongation. 

In summary, the consistent enrichment of the CEEs in 
p300 binding as well as their target genes with initiating 
and elongating Pol 11 strongly suggests that we have 
identified bona fide enhancer-target gene pairs by 
reanalyzing the previous Hi-C results. We also compared 
our CEEs with the enhancers predicted in the recently 
pubhshed study using 5C with primers designed to the 
ENCODE pilot project regions covering only ~1% of 
the human genome (23). We found that our CEEs 
overlap significantly with these enhancer elements 
compared with all extended hotspots lying within the 
ENCODE pilot region as a background control 
(Table 3). Thus, our method provides an important im- 
provement over previous approaches for identifying 
human enhancer elements because we not only identify 
enhancers, but we also uncover their specific regulatory 
targets on a genome- wide scale. 



Characterization of CEE-target gene interactions 

We found that CEEs interact with 1.17-1.62 target gene 
promoters on average (Table 4), which is consistent with 
recent results (23) and suggest that human enhancer 
elements can interact pleiotropically. Additionally, we 
found that most target gene promoters interacted with 
multiple (1.17-2.36) CEEs (Table 4), suggesting the exist- 
ence of enhancer redundancy in the human genome. 
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We also determined the interaction characteristics of 
CEEs and their target genes. We found that the vast 
majority of these interactions are intra-chromosomal (on 
the same chromosome), while fewer than 13% are 
inter-chromosomal (Figure 6a) with httle read support 
for these latter associations (Supplementary Table SI). 
Interestingly, we found that >95% of the intra- 
chromosomal interactions occur within a range of 1 Mb 
(Figure 6b). In total, these results indicate that the 
majority of the CEEs that we have identified from the 
Hi-C data are in relatively close proximity to their target 
promoters. 

CEEs and target genes are enriched in enhancer- 
associated motifs 

To identify specific sequence motifs in the CEEs and their 
target promoters, we further searched for overrepresented 
sequences using HOMER (36). Not surprisingly, a quick 
search using a random genomic background yielded the 
recognition site of Hindlll (AAGCTT) as top motif in the 
CEEs identified in by the original GM/Hindlll Hi-C ex- 
periment (Table 5). These results suggest that as expected 
the Hi-C experimental approach identifies DNA inter- 
action sites that are localized near restriction sites in the 
human genome (Figure 2). To minimize this bias for RE 
sites, we performed the motif searches with a background 
of all extended hotspots. As a result, we identified 38, 54 
and 39 motifs from each experiment, including the binding 
motifs of known enhancer-associated transcription factor 
families such as Spl, NRFl, E2F, GATA and ETS 
(Supplementary Dataset S2 and Table 6). Remarkably, 
we found that in all three CEE datasets, there was signifi- 
cant enrichment for the binding sequence of the E26 
transformation-specific (ETS) family binding 



Table 3. Comparison of the CEEs predicted using Hi-C and 
enhancer predictions in 5C (27) 





Hi-C 


5C 




P value 


Samples 


# CEEs 


# Enhancers 


# Intersects 


of intersects 


GM^/Hindlll 


19 


87 


1 


0.1316 


GM^/Ncol 


37 


87" 


5 


0.0001 


K562/HindIII 


137 


119 


9 


<0.0001 



# CEEs shows the number of CEEs that is overlapped with the 5C 
primers along the ENCODE pilot regions. 

P-value is calculated by permutation tests using the extended hotspots 

that overlap the ENCODE primer sets as background. 

"5C study uses GM 12878; Hi-C study uses GM06990. 

''5C study uses only Hindlll as the RE; here we are comparing using 

the GM/Hindlll dataset. 



domain-containing proteins. These proteins act as tran- 
scription factors that bind to specific enhancers and pro- 
moters, and facilitate the assembly of transcription 
machinery to initiate gene expression (43,44). Thus, our 
CEEs are enriched in sequences known to bind enhancer- 
specific proteins. 

CEEs are conserved within vertebrates 

Functional elements are often under evolutionary selec- 
tion because of their cellular function(s) (45). To study if 
the CEEs are under evolutionary selection, we investigated 
the conservation score in these elements across the mam- 
mahan clade [cons44way conservation score (46)] 
compared with their upstream and downstream flanking 
sequences. We found that the CEEs tend to be more 
conserved than their flanking regions (P < 0.05 for all 
datasets) (Figure 7a). In total, these results revealed that 
the CEEs that we have identified are under purifying se- 
lection in the human genome, suggesting that they are 
functional enhancer elements. 

CEE target genes tend to display tissue-specific expression 

Enhancer elements are known to function in a cell type- 
specific manner (10), so the expression profiles of their 
target genes are likely to display a similar pattern. To 
determine whether genes targeted by CEEs exhibit this 
cell type-specific tendency, we computed the Q statistic 
(34) for every human gene expressed in nine ENCODE 
cell types (see Methods for descriptions), and then 
compared CEE target genes with all other loci in the cell 
types (GM 12878 or K562) most closely corresponding to 
those used in the original Hi-C experiment (GM06990 
or K562). We found that genes targeted by the CEEs 
have significantly lower Q values, indicating that these 
loci are expressed in a cell type-specific manner. This is 
true for CEEs identified using all three Hi-C experiments 
(P = 0.01, 2.11e-05, 4.24e-16 for GM/Hindlll, GM/Ncol 
and K562/HindIII, respectively). In total, all of our results 
suggest that we have identified thousands of bona fide 
enhancer-target gene interactions. A significant amount 
of future attention can now be focused on determining 
the biological functions and significance of these newly 
identified interactions in human cells. 



DISCUSSION 

The original Hi-C article suggested that this experimental 
approach could identify regulatory elements with better 
sequencing throughput, although this analysis was never 
performed. In this study, we show that with careful 



Table 4. Characteristics of enhancer-target interactions 



Samples 


# CEEs 


Average CEE 
interactions 


# of target 
proinoters 


Average target 
promoter interactions 


# of enhancer-promoter 
interactions 


GM/Hindlll 


823 


1.17 


820 


1.17 


953 


GM/NcoI 


4809 


1.62 


3444 


2.27 


7757 


K562/HindIII 


5033 


1.42 


3032 


2.36 


7104 
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Figure 6. Characterization of CEE-target gene interaction distance, (a) The portion of inter- and intra-chromosomal CEE-target interactions for the 
three different Hi-C samples as denoted, (b) The distance distribution of intra-chromosomal CEE-target interactions. 



Table 5. Top 3 most enriched motifs for all CEEs using the whole-genome as the background sequence in the K562/ 
Hindlll library 

Motif P-value % of % of 

Targets Background 

Xa-T^^ CTT"^ <le-300 87.9 55.2 

TTgCAA^C ^^^^^ 

CATCi A C_TCA <ie 228 602 556 



Table 6. Top 10 most enriched motifs in CEEs from the GM/NcoI library 


using extended hotspots 


as the background 




Transcription factor 
(DNA binding domain) 


Motif 


P-value 


% of Targets 


% of 

Background 


Spl(Zf) 


i CQCC CCQC? 


le-134 


38.5 


23.6 


NRFl(NRF) 


gl^CuCaiGCQC 


le-lU 


15.0 


6.4 


ETS(ETS) 


§iCCGGAAGT 


le-69 


41.6 


30.3 


ELFl(ETS) 


igccjGAAGI 


le-64 


58.2 


46.8 


GFY-Staf 


^ACTACAOTCCCAS^ASQC 


le-54 


7.5 


3.1 


NRFl 


c cat c c 


le-51 


19.2 


12.0 


YYl(Zf) 


9AAiATGGCGGC 


le^5 


9.1 


4.61 


E2F 


8SQGCGGGAA 


le^5 


30.2 


22.0 


GEY 


ACIiCA^IICCC 


le-38 


10.0 


5.6 


GABPA(ETS) 


iSOSGGAAgl 


le-32 


77.3 


70.1 
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Figure 7. Potential entiancer elements are evolutionarily conserved, and their target genes are expressed in a cell type-specific manner, (a) The 
conservation score of CEEs (black bars) compared with similarly-sized flanking regions (gray bars) from the three different Hi-C experiments 
(as specified), (b) The Q statistic values for CEE target (black bars) compared with non-target (gray bars) genes from the three different 
Hi-C experiments (as specified). Error bars indicate s.e.m. Differences are statistically significant (*/'<0.05, **P<O.Q\ and ***/"< 0.001, 
Wilcoxon rank-smn test). 



analyses and comprehensive integration of publicly avail- 
able functional genomic datasets, Hi-C data can be used 
to comprehensively identify enhancer-target gene inter- 
actions genome-wide. We first used a geometric distribu- 
tion model to identify DNA interacting hotspots instead 
of using a sliding window to probe 1 Mb segments of the 
human genome. This change in analysis methods signifi- 
cantly improved our genomic resolution ('^3.3 kb or 
300-fold improvement). This increase in resolution is ne- 
cessary for identifying the actual sequences of regulatory 
elements in the human genome. From this initial list of 
DNA interacting hotspots, we focused on intergenic 
sequences that interact with protein-coding gene pro- 
moters, and found these elements overlap significantly 
with enhancer-associated chromatin marks such as 
H3K27ac and H3K4mel that have been previously used 
to identify enhancer elements (9,10). Interestingly, a recent 
study by Chepelev et al. used this property by combining 
Hi-C with H3K4me2 immunoprecipitation to identify 
enhancer-promoter interactions (22). However, this 
study focused solely on cis- interactions and did not 
examine other enhancer-associated epigenetic marks. 
Here, we used multiple chromatin marks as weU as DHS 
datasets to identify thousands of CEEs in two human ceU 
types with high confidence. We also uncovered that not all 
epigenetic marks are equal for these purposes. Specifically, 
we found that all four activating histone marks are 
enriched on the putative enhancers, but they demonstrate 
distinct levels of enrichment (Figure 4). In total, our 
analysis pipeline incorporates data for multiple histone 
modifications and DHSs, which increases confidence that 
bona fide enhancer elements are truly being identified. 

In addition, our analysis is unique when compared with 
three other studies that were recently published. 
Specifically, Lan et al. (47) integrated histone modification 
data that overlapped sites enriched with reads from Hi-C 
experiments for the K562 ceU fine, and found 12 clusters 
of Hi-C sites with different combinations of histone modi- 
fications. However, their analyses were limited only to 
these overlapping regions and did not also interrogate 



aU of the other relevant datasets as we have done here. 
Furthermore, their study only examined enhancer- 
promoter interactions on a specific subset of the human 
genome (GATA1/GATA2 target genes). In another recent 
study, 5C experiments were performed to study enhancer- 
promoter interactions. However, they focused entirely on 
the 44 ENCODE pilot genomic regions instead of per- 
forming a genome-wide analysis (23). This is because a 
global study of enhancer-promoter interactions is not 
feasible with the 5C protocol, as this approach requires 
the design of specific primers for a select group of targeted 
regions. ChlA-PET, another recently developed protocol 
that detects chromosomal interactions using high- 
throughput sequencing (22) was also used to study 
enhancer-promoter pairs. However, as pointed out by 
the developers of this method, their approach is different 
from unbiased approaches hke Hi-C because it requires an 
antibody to a specific histone modification, protein, etc. 
Thus, this method wiU not detect any enhancers that are 
not in close proximity to the histone modification, protein, 
etc. being immunoprecipitated. 

Our analyses revealed that unannotated long-range and 
inter-chromosomal enhancer-target gene interactions can 
be detected using Hi-C data. This is in strong contrast to 
previous studies of short-range enhancer-target gene 
interactions, namely predicting cw-targeted genes within 
a small fixed window (13) or by defining a variable but 
local transcriptional domain (48) around the identified 
enhancer elements. We found inter-chromosomal inter- 
action to be much less frequent than both cis and trans 
intra-chromosomal interactions (Figures 6a and b). This 
may be because the inter-chromosomal and long-range 
interactions are underestimated due to the limited 
sequencing depth of the initial Hi-C experiments, or to 
these being less stable and/or transient interactions. 
Thus, we may identify more of these interactions with 
future Hi-C experiments with much greater sequencing 
depth. 

We have also uncovered both one-to-one and 
multiple-to-multiple CEE-target interactions (Table 4). 
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These results reveal the extreme complexity of enhancer- 
target promoter relationships in the human genome. 
Interestingly, genes targeted by the same enhancer 
element could be co-regulated, competing or activated in 
different developmental stages or tissue types. Similarly, 
enhancer elements that target the same gene could also be 
cooperative or competing in maintaining gene expression 
homeostasis or for altering expression activities of the 
target gene. Comprehensive time-course studies with 
high read coverage (for better sensitivity) will be necessary 
to further elucidate the regulatory mechanisms behind 
each enhancer-target promoter interaction. 

The Hi-C protocol has an inherent hmitation for 
enhancer discovery as we have described (Figure 2). 
Specifically, we have revealed that the data from the 
Hi-C protocol actually detects RE sites around the bona 
fide DNA-DNA interaction regions. While increasing the 
read coverage is still essential for obtaining high-quahty 
enhancer-promoter interaction data, it does not solve this 
particular limitation. To accommodate this lapse in reso- 
lution, we had to extend the identified DNA interacting 
hotspots in both directions, as we could not predict which 
direction to extend based on the Hi-C sequencing data 
alone. Thus, our analysis workflow is the first to allow 
confident prediction of enhancer-target promoter inter- 
actions from Hi-C data, and provides the framework for 
future studies that will use this approach for these same 
purposes. 

Applying our analysis workflow to identify DNA inter- 
action information from Hi-C, aUows identification of 
candidate enhancers and their associated target genes. In 
the future, comparing datasets similar to the ones 
provided here with findings from GWAS and eQTL 
studies is likely to provide mechanistic insights into how 
many intergenic SNPs can be associated with a certain 
disease. In total, a comprehensive list of enhancer- 
promoter interactions is hkely to significantly improve 
the resources available to future genetic studies focused 
on human disease. 
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